library(plotly)
library(data.table)
library(tidyr)
library(knitr)
library(heatmaply)

Preprocessing

  • Load data file
  • rename genres for better readability
    • “Religion, Spirituality & New Age” to “Religion”
    • “Science.fiction” to “SciFi”
    • “Action.and.Adventure” to “Action”

All genres:

##  [1] "Satire"        "SciFi"         "Drama"         "Action"       
##  [5] "Romance"       "Mystery"       "Horror"        "Self.help"    
##  [9] "Health"        "Guide"         "Travel"        "Children.s"   
## [13] "Religion"      "Science"       "History"       "Math"         
## [17] "Anthology"     "Poetry"        "Encyclopedias" "Dictionaries" 
## [21] "Comics"        "Art"           "Cookbooks"     "Diaries"      
## [25] "Journals"
  • Check if upper and lower triangle identical
## [1] TRUE
  • Transform to long and tidy data.table
head(books_dt)
##     genreA genreB customers
## 1:  Satire Satire      3798
## 2:   SciFi Satire       423
## 3:   Drama Satire        19
## 4:  Action Satire       343
## 5: Romance Satire       505
## 6: Mystery Satire       227
  • Average number of genres per customer
## [1] 2.332187

First ideas

Show me everything!

  • Romance, SciFi, Action, History are most bought
  • bought-together clusters:
    • Romance, SciFi, Action, History
    • Dictionaries and Comics
    • Math and Poetry
  • Mystery is an outlier

Most bought genre

Best pairs

  • mostly combinations of most bought genres

Special genres

Hypothesis

  • If a customer buys more than 2 genres, he is recorded in more than 1 off-diagonal entry:
    • (2*diagonal - colSum) < 0
  • If a genre is bought more often alone than in triplets (or higher):
    • (2*diagonal - colSum) > 0

Look for customers that buy only one genre

  • Compare column sum and 2*diagonal value
  • generate table with {genre, {2*diagonal-colSum}}

  • Mystery and Horror are mostly bought alone
  • Satire and Travel rather bought in pairs

Normalize columns by diagonal

##    genreA  genreB customers rel_customers
## 1: Action  Satire       343  0.0069304130
## 2: Action   SciFi     44698  0.9031358603
## 3: Action   Drama        23  0.0004647216
## 4: Action  Action     49492  1.0000000000
## 5: Action Romance     15685  0.3169199062
## 6: Action Mystery      1599  0.0323082518

–> genreB relative to genreA-diagonal value

Look at all data unsorted: No pattern.

With clustering of rows and columns (Note: they are different now):

  • 2 hubs on genreA axis (top dendro)
    • Art, Journals, Action, SciFi, History
    • Encyclopedias, Comics, Disctionaries, Poetry, Math, Anthology
    • e.g. genres that were bought with Art were also bought together with Journals
  • 2 hubs on genreB axis (right dendro)
    • Romance, History, Action, SciFi –> Romance instead of Art and Journals
    • same
  • bought with everything else? Romance

Most favorite partner genre

–> about 20% customers additionally bought SciFi and Romance

Relative best pairs

–> Math is poetry and History is Science fiction!